Skip to content

[Model] add colqwen2_vl code & inference#14291

Closed
BloomBerry wants to merge 7 commits intovllm-project:mainfrom
BloomBerry:colqwen2_vl
Closed

[Model] add colqwen2_vl code & inference#14291
BloomBerry wants to merge 7 commits intovllm-project:mainfrom
BloomBerry:colqwen2_vl

Conversation

@BloomBerry
Copy link
Copy Markdown

@BloomBerry BloomBerry commented Mar 5, 2025

Add support for ColQwen2VL model
Description
This PR adds support for the ColQwen2VL model to vLLM. ColQwen2VL is an efficient document retrieval vision language model based on Qwen2VL, as described in the paper "ColPali: Efficient Document Retrieval with Vision Language Models". The model is designed to generate embeddings rather than text outputs, making it suitable for document retrieval applications.
Key implementation details:
Extended the existing Qwen2VL implementation for ColQwen2VL compatibility
Implemented custom text projection layer and L2 normalization for embedding generation
Added appropriate processing utilities for image and video inputs
Overrode forward, compute_logits and sample methods to optimize for embedding output
This implementation enables users to leverage ColQwen2VL's multimodal document retrieval capabilities through vLLM's efficient serving infrastructure.
Testing
Tested with sample image inputs
Verified embedding output format and dimensions
Confirmed compatibility with HuggingFace ColQwen2VL models

FIX #19381

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 5, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify Bot added the documentation Improvements or additions to documentation label Mar 5, 2025
@DarkLight1337
Copy link
Copy Markdown
Member

Thanks for implementing this! Can you update the following files as well?

  • Supported Models page
  • Test registry tests/models/test_registry.py
  • Model correctness tests tests/models/embedding/vision_language
  • Processor correctness tests tests/models/multimodal/processing/test_common.py

@DarkLight1337 DarkLight1337 changed the title add colqwen2_vl code & inference [Model] add colqwen2_vl code & inference Mar 5, 2025
Signed-off-by: BloomBerry <jyjang1090@gmail.com>
Signed-off-by: BloomBerry <jyjang1090@gmail.com>
Signed-off-by: BloomBerry <jyjang1090@gmail.com>
Signed-off-by: BloomBerry <jyjang1090@gmail.com>
@mgoin
Copy link
Copy Markdown
Member

mgoin commented May 25, 2025

Hey @BloomBerry I'm working on reviving this PR since it has drifted away from the refactors on main and needs some more testing. Would you want me to push to this PR myself or I can start a new one.

It seems to require this Transformers PR huggingface/transformers#35778

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jun 4, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BloomBerry.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 4, 2025
@mgoin mgoin mentioned this pull request Jun 9, 2025
1 task
@mergify mergify Bot added the qwen Related to Qwen models label Jun 19, 2025
@issahammoud
Copy link
Copy Markdown

Hi, is there an estimation when the PR will be merged?

@mergify mergify Bot added the new-model Requests to new models label Jul 11, 2025
@SMAntony
Copy link
Copy Markdown

SMAntony commented Sep 2, 2025

Is anyone working on this?

@issahammoud
Copy link
Copy Markdown

I was able to serve ColQwen 2.5 vl 3B (https://huggingface.co/Metric-AI/ColQwen2.5-3b-multilingual-v1.0) with vllm by doing some modifications to the source code.

The idea is to use the Qwen 2.5 VL with ALL pooling type so it outputs all embedding vectors for late interaction.

Here is a git patch you can apply on vllm source code (tested with v0.11.0).
colqwen.patch

I am using it with the local weights of Metric-AI/ColQwen2.5-3b-multilingual-v1.0 (with the base config from vidore/colqwen2.5-base).

You just need to change the architecture name in the config.json from ColQwen2_5, to ColQwen2_5_VLForConditionalGeneration and add the following modules.json file

[
  {
    "idx": 0,
    "name": "0",
    "path": "",
    "type": "sentence_transformers.models.Transformer"
  }
]

I am running the openai compatible server in docker compose as follows:

entrypoint: ["vllm", "serve"]
command:
      - "/root/.cache/huggingface/hub/models--Metric-AI--ColQwen2.5-3b-multilingual-v1.0/snapshots/e2a1c05d053dcf4ad6e39b6c48ced9d6a81071f0"
      - "--host"
      - "0.0.0.0"
      - "--port"
      - "8000"
      - "--runner"
      - "pooling"
      - "--convert"
      - "embed"
      - "--dtype"
      - "bfloat16"
      - "--max-model-len"
      - "1024"
      - "--gpu-memory-utilization"
      - "0.8"
      - "--trust-remote-code"
      - "--quantization"
      - "bitsandbytes"
      - "--override-pooler-config"
      - '{"pooling_type":"ALL","normalize":true}'
      - "--served-model-name"
      - "anyname"

It is working well with high throughput on a 8GB GPU. Hope it helps.

@HoangTung-Vu
Copy link
Copy Markdown

Does your patch support multimodal (image) embedding ?

@issahammoud
Copy link
Copy Markdown

Does your patch support multimodal (image) embedding ?

@HoangTung-Vu Yes indeed.

You should follow the same query strucuture as copali-engine:

payload={
            "model": "my_model_name",
            "messages": [{
                "role": "user",
                "content": [
                    {"type": "text", "text": "<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe the image.<|im_end|><|endoftext|>"},
                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_base64}"}}
                ],
            }],
            "encoding_format": "float",
        }

        resp = requests.post(embedding_url, json=payload)
       

However, you cannot use openai client code because it does not support multimodal embedding.

@HoangTung-Vu
Copy link
Copy Markdown

HoangTung-Vu commented Oct 13, 2025

I already used requests directly instead of OpenAI client code but i encountered 400 Bad Request Error.
Did you add any config to the model ?

If i comment out the image part, it works

embedding_url = "http://50.175.95.210:50168/v1/embeddings/"

payload={
    "model": "colqwen",
    "messages": [{
        "role": "user",
        "content": [
            {"type": "text", "text": "<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe the image.<|im_end|><|endoftext|>"},
            # {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}}
        ],
    }],
    "encoding_format": "float",
}

resp = requests.post(embedding_url, json=payload)
print(resp)       

@issahammoud
Copy link
Copy Markdown

issahammoud commented Oct 13, 2025

@HoangTung-Vu I need more context to understand why it happened to you.

Could you tell me exactly the steps you did, and the whole message error?

@HoangTung-Vu
Copy link
Copy Markdown

I applied your patch using Git commands, but it raised some errors, so I manually integrated the changes instead.
I cloned the vllm repository and applied the modifications on the main branch (currently at version v0.11.0).

For the model, I cloned OpenGVLab/colqwen2_5-3b-base, added the modules.json file as in your implementation, and updated the model class in config.json.

However, when sending a request to the model, I still receive a 400 Bad Request response.

@issahammoud
Copy link
Copy Markdown

@HoangTung-Vu Make sure that vllm is loading the correct model. It happened to me that it loaded a default model because it could not load the local one.
In addition, when cloning vllm and adding the changes, you should build it from source obviously so the changes are taken into consideration. This step can take a lot of time (up to multiple hours based on your config).

I did install the docker version for my specific hardware so it was faster.
So I suggest that you make sure it is loading your model and not a default one, and confirm that you installed vllm from source and you are using it.

Here is my docker compose for an RTX 3070:


embedding:
    build:
      context: vllm
      dockerfile: docker/Dockerfile
      target: vllm-openai
      args:
        - max_jobs=8
        - nvcc_threads=2
        - torch_cuda_arch_list=8.6
        - VLLM_USE_PRECOMPILED=1
    environment:
      - DOCKER_BUILDKIT=1
      - CUDA_VISIBLE_DEVICES=0
      - VLLM_HOST=0.0.0.0
      - VLLM_PORT=8000
      - CUDA_HOME=/usr/local/cuda-12.8
      - CUDACXX=/usr/local/cuda-12.8/bin/nvcc
      - LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64
      - HF_HUB_OFFLINE=1
      - TORCH_CUDA_ARCH_LIST=8.6
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - "8000:8000"
    entrypoint: ["vllm", "serve"]
    command:
      - "/root/.cache/huggingface/hub/models--Metric-AI--ColQwen2.5-3b-multilingual-v1.0/snapshots/e2a1c05d053dcf4ad6e39b6c48ced9d6a81071f0"
      - "--host"
      - "0.0.0.0"
      - "--port"
      - "8000"
      - "--runner"
      - "pooling"
      - "--convert"
      - "embed"
      - "--dtype"
      - "bfloat16"
      - "--max-model-len"
      - "1024"
      - "--gpu-memory-utilization"
      - "0.8"
      - "--trust-remote-code"
      - "--quantization"
      - "bitsandbytes"
      - "--override-pooler-config"
      - '{"pooling_type":"ALL","normalize":true}'
      - "--served-model-name"
      - "my-model-name"

    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 60s
      timeout: 300s
      retries: 3
    restart: unless-stopped
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      

@HoangTung-Vu
Copy link
Copy Markdown

I ran my tests on a cloud instance from Vast.ai. Since it is a virtual container environment, I was not able to use Docker Compose as in your setup.

For the model (ColQwen), I cloned it directly from Hugging Face. I chose the base model so that I could edit the model_class field in config.json. The fine-tuned variants only include adapter configurations, so they were not suitable for this purpose.

When running vLLM, I pointed directly to the local model directory, so I assume it correctly loaded the intended model.

Regarding vLLM itself, I installed it from source using:

pip install -e .

I suspect that the 400 Bad Request error might be caused by an incorrect configuration of the ColQwen model on my side. I’ll review the model setup again to ensure it matches your patch specifications.

@issahammoud
Copy link
Copy Markdown

@HoangTung-Vu
I recommend to set HF_HUB_OFFLINE=1 so it will not try to download another model.
Also check the .cache file to see if there are models you are not aware of.

@HoangTung-Vu
Copy link
Copy Markdown

I have rechecked the configuration and reinstalled everything.
However, with the message template above, it works when the image is provided via a URL, but not when using a base64 string.
Do you know why this might be happening? Thank you very much!

@DarkLight1337
Copy link
Copy Markdown
Member

DarkLight1337 commented Oct 15, 2025

What does your base64 URL look like? Make sure it is in the correct format

@issahammoud
Copy link
Copy Markdown

@HoangTung-Vu
Check the base64 format, I convert a PIL image as follows:

buffer = io.BytesIO()
img.save(buffer, format="png")
buffer.seek(0)
img_base64 = base64.b64encode(buffer.read()).decode("utf-8")

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions Bot added the stale Over 90 days of inactivity label Jan 16, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 16, 2026

Hi @BloomBerry, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

@github-actions github-actions Bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Jan 18, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 18, 2026

Documentation preview: https://vllm--14291.org.readthedocs.build/en/14291/

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 18, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BloomBerry.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@TaoZC1996
Copy link
Copy Markdown

Refer

@issahammoud I successfully ran your patch on vllm. However, when I call v1/embeddings, whether for text or images, the returned result is always a single 128-dimensional vector. why is this? thank you

@issahammoud
Copy link
Copy Markdown

@TaoZC1996
Make sure you set pooling_type to ALL so the model does not return a single vector ('{"pooling_type":"ALL","normalize":true}').
Also verify you are using the same version this patch was developed for (v0.11.0). I think the current version changed some arguments names and other things.

@hmellor
Copy link
Copy Markdown
Member

hmellor commented Mar 4, 2026

Closing as stale — this has had unresolved merge conflicts and failing checks for a long time. Feel free to open a fresh PR if you'd like to revisit. Thanks for the contribution!

@hmellor hmellor closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation needs-rebase new-model Requests to new models qwen Related to Qwen models unstale Recieved activity after being labelled stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: Support ColQwen2VL

8 participants